Relevant Characteristics Extraction from Semantically Unstructured Data Phd Thesis Title: " Data Mining for Unstructured Data " Author: Relevant Characteristics Extraction from Semantically Unstructured Data Relevant Characteristics Extraction from Semantically Unstructured Data

نویسنده

  • Lucian N. VINTAN
چکیده

1 Introduction Most data collections from real world are in text format. Those data are considered semi structured data because they have a small organized structure. Modeling and implementing on semi structured data from recent data bases grows continually in the last years. More over, information retrieval applications, as indexing methods of text documents, have been adapted in order to work with unstructured documents. Traditional techniques for information retrieval became inadequate for searching in a large amount of data. Usually, only a small part of the available documents are relevant for the user. Without knowing what is in the documents it is difficult to formulate effective queries for analyzing and extracting interesting information. Users need tools to compare different documents like effectiveness and relevance of documents or finding patterns to direct them on more documents. There are an increasing number of online documents and an automated document classification is an important task. It is essential to be able to automatically organize such documents into classes so as to facilitate document retrieval and analysis. One possible general procedure for this classification is to take a set of pre-classified documents and consider them as the training set. The training set is then analyzed in order to derive a classification scheme. Such a classification scheme often needs to be refined with a testing process. After that, this scheme can be used for classification of other on-line documents. The classification analysis decides which attribute-value pairs set has the greatest discriminating power in determining the classes. An effective method for document classification is to explore association-based classification, which classifies documents based on a set of associations and frequently occurring text patterns. Such an association-based classification method proceeds as follows: (1) keywords and terms can be extracted by information retrieval and simple association analysis techniques; (2) concept hierarchies of keywords and terms can be obtained using available term classes, or relying on expert knowledge or some keyword classification systems. Documents in the training set can also be classified into class hierarchies. A term-association mining method can then be applied to discover sets of associated terms that can be used to maximally distinguish one class of documents from another. This produces a set of association rules for each document class. Such classification rules can be ordered-based on their occurrence frequency and discriminative power-and used to classify new documents. Text classification is a very general process that includes …

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Ontology-driven Information Extraction

Homogeneous unstructured data (HUD) are collections of unstructured documents that share common properties, such as similar layout, common file format, or common domain of values. Building on such properties, it would be desirable to automatically process HUD to access the main information through a semantic layer – typically an ontology – called semantic view. Hence, we propose an ontology-bas...

متن کامل

Application-specific Semantic Information Extraction from Unstructured Data

— With a rapid growth of information available on the internet there is also growing demand in applications that can process data from different sources accessing information that is only needed for its particular use. Due to the fact that information on the internet is mostly unstructured and thus cannot be processed automatically many tasks which require information extraction are still perfo...

متن کامل

Ontology-guided extraction of structured information from unstructured text: Identifying and capturing complex relationships

Many applications call for methods to enable automatic extraction of structured information from unstructured natural language text. Due to the inherent challenges of natural language processing, most of the existing methods for information extraction from text tend to be domain specific. This thesis explores a modular ontology-based approach to information extraction that decouples domain-spec...

متن کامل

Text Mining: Extraction of Interesting Association Rule with Frequent Itemsets Mining for Korean Language from Unstructured Data

Text mining is a specific method to extract knowledge from structured and unstructured data. This extracted knowledge from text mining process can be used for further usage and discovery. This paper presents the method for extraction information from unstructured text data and the importance of Association Rules Mining, specifically for of Korean language (text) and also, NLP (Natural Language ...

متن کامل

Accessing Unstructured Data through Mobile Devices

The paper presents an on-going work on accessing unstructured data in the web through mobile devices. To achieve this we use Information Extraction (IE) to extract relevant information from unstructured documents. Here the relevant information are extracted are stored into a database, where a user can search information by giving a query through mobile. The extracted information that matches wi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006